Code
import pandas as pd
eda = pd.read_parquet("data/eda.parquet")# identifying data analyst jobs by keyword searching
keywords = ['Data Analyst', 'Business Analyst', 'Data Engineering', 'Deep Learning',
'Data Science', 'Data Analysis','Data Analytics', 'Market Research Analyst'
'LLM', 'Language Model', 'NLP', 'Natural Language Processing',
'Computer Vision', 'Business Intelligence Analyst', 'Quantitative Analyst', 'Operations Analyst']
match = lambda col: eda[col].str.contains('|'.join(keywords), case=False, na=False)
eda['DATA_ANALYST_JOB'] = match('TITLE_NAME') \
| match('SKILLS_NAME') \
| match('SPECIALIZED_SKILLS_NAME')
# eda['DATA_ANALYST_JOB'].value_counts()import plotly.graph_objects as go
from plotly.subplots import make_subplots
df_grouped = (
eda
.groupby(['DATA_ANALYST_JOB','NAICS2_NAME'])
.size()
.reset_index(name='Job_Count')
)
short_names = {
'Professional, Scientific, and Technical Services': 'Tech. Services',
'Administrative and Support and Waste Management and Remediation Services': 'Admin & Waste Mgmt',
'Health Care and Social Assistance': 'Healthcare',
'Finance and Insurance': 'Finance',
'Information': 'Info Tech',
'Educational Services': 'Education',
'Manufacturing': 'Manufacturing',
'Retail Trade': 'Retail',
'Accommodation and Food Services': 'Hospitality',
'Other Services (except Public Administration)': 'Other Services'
}
df_grouped['Industry'] = df_grouped['NAICS2_NAME'].map(short_names).fillna(df_grouped['NAICS2_NAME'])
df_grouped['Job_Type'] = df_grouped['DATA_ANALYST_JOB'].map({True:'True', False:'False'})
pivot = (
df_grouped
.pivot_table(index='Industry', columns='Job_Type', values='Job_Count', fill_value=0)
.reset_index()
)
industries = pivot['Industry'].tolist()
y_true = pivot['True'].tolist()
y_false = pivot['False'].tolist()
# 2) Build a 2-row subplot: bar on top, table below
fig = make_subplots(
rows=2, cols=1,
row_heights=[0.70, 0.30], # give a bit more room to the table
specs=[[{"type":"bar"}],[{"type":"table"}]],
vertical_spacing=0.12 # more space between bar and table
)
colors = {'True': '#FFE5E5', 'False': '#FF6B6B'}
fig.add_trace(
go.Bar(
x=industries, y=y_true, name='True',
marker=dict(color=colors['True'], line=dict(color='#A81D1D', width=1)),
text=y_true, textposition='outside'
),
row=1, col=1
)
fig.add_trace(
go.Bar(
x=industries, y=y_false, name='False',
marker=dict(color=colors['False'], line=dict(color='#A81D1D', width=1)),
text=y_false, textposition='outside'
),
row=1, col=1
)
# 3) Slider steps: 0 → 8 000 in 200s
steps = []
for val in range(0, 8001, 200):
steps.append(dict(
label=str(val),
method="update",
args=[
{"y": [
[v if v>=val else 0 for v in y_true],
[v if v>=val else 0 for v in y_false]
]}
]
))
# 4) Final layout tweaks
fig.update_layout(
# lift slider above everything
sliders=[dict(
active=0,
currentvalue={"prefix":"Min Jobs: "},
pad={"b":0},
x=0.05,
y=1.05, # move slider way above the plot area
xanchor="left",
yanchor="bottom",
len=0.7,
font=dict(color='#A81D1D'),
steps=steps
)],
title=dict(
text="Data & Business Analytics Job Trends",
font=dict(size=24, color='#A81D1D'),
x=0.5,
y=0.95, # drop the title just below the slider
xanchor="center",
yanchor="top"
),
width=1100, height=850,
margin=dict(l=60, r=60, t=180, b=200), # extra top & bottom margin
plot_bgcolor='white',
paper_bgcolor='white',
xaxis=dict(
title="Industry",
title_font=dict(size=16, color='#A81D1D'),
tickmode='array',
tickvals=list(range(len(industries))),
ticktext=industries,
tickangle=-30,
tickfont=dict(size=11, color='#333'),
showline=True, linecolor='#A81D1D'
),
yaxis=dict(
title="Number of Jobs",
title_font=dict(size=16, color='#A81D1D'),
tickfont=dict(size=11, color='#333'),
gridcolor='rgba(200,200,200,0.3)',
showline=True, linecolor='#A81D1D',
range=[0, max(max(y_true),max(y_false))*1.2]
),
legend=dict(
title="Data Analyst Job",
title_font=dict(color='#A81D1D'),
font=dict(size=12),
x=0.95, y=0.95
),
bargap=0.2
)
fig.write_html(
"figures/edaplot1.html",
include_plotlyjs="cdn", # Use CDN to load Plotly JS
full_html=False # Only include the plot div
)Tech & Services dominates With roughly 7,620 analytics roles versus 15,550 non-analytics roles, this sector is by far the biggest home for Data Analyst jobs—both in absolute counts and total listings. Info Tech is the most analytics-centric In “Info Tech,” the split is nearly 50/50 (1,970 analytics vs. 1,855 non-analytics), suggesting analytics is core to many IT functions, not just a niche add-on. Finance & Unclassified Industries are a close 2–3 Finance shows about 4,246 analytics positions against 5,860 non-analytics, while “Unclassified Industry” is similarly high (4,148 vs. 5,256). Both fields clearly lean heavily on analytics but still carry large non-analytics wings. Education skews analytics With 1,385 analytics vs. just 516 non-analytics listings, Education is punching above its weight—the majority of roles posted there specifically call out analytics skills. Low-analytics sectors Retail, Healthcare, Manufacturing, Public Administration and most “hands-on” industries (Construction, Hospitality, Mining) show tiny pink bars. Data roles are a small slice of the total.
import plotly.express as px
import pandas as pd
# Assuming eda is already loaded and DATA_ANALYST_JOB is defined
df = eda.copy()
# Step 1: Map DATA_ANALYST_JOB to labels
df['Job_Category'] = df['DATA_ANALYST_JOB'].map({True: 'Analytics Job', False: 'Non-Analytics Job'})
# Step 2: Clean the data (remove rows with missing SPECIALIZED_SKILLS_NAME)
df = df.dropna(subset=['SPECIALIZED_SKILLS_NAME'])
# Debug: Check the number of rows after cleaning
# print("Number of rows after cleaning:", len(df))
# Step 3: Split the SPECIALIZED_SKILLS_NAME into individual skills
# Assuming SPECIALIZED_SKILLS_NAME is a string of skills separated by commas or another delimiter
df_skills = df.copy()
df_skills['SPECIALIZED_SKILLS_NAME'] = df_skills['SPECIALIZED_SKILLS_NAME'].str.split(',') # Adjust delimiter if needed
df_skills = df_skills.explode('SPECIALIZED_SKILLS_NAME')
df_skills['SPECIALIZED_SKILLS_NAME'] = df_skills['SPECIALIZED_SKILLS_NAME'].str.strip()
# Step 4: Group by skill and Job_Category to get the count
df_skills_count = df_skills.groupby(['SPECIALIZED_SKILLS_NAME', 'Job_Category']).size().reset_index(name='Count')
# Step 5: Get the top 10 skills by total count
top_skills = df_skills_count.groupby('SPECIALIZED_SKILLS_NAME')['Count'].sum().nlargest(10).index
df_skills_top = df_skills_count[df_skills_count['SPECIALIZED_SKILLS_NAME'].isin(top_skills)]
# Debug: Check the grouped data
# print("Top 10 specialized skills:")
# print(df_skills_top)
# Step 6: Create the bar plot
fig = px.bar(
df_skills_top,
x='Count',
y='SPECIALIZED_SKILLS_NAME',
color='Job_Category',
barmode='stack',
color_discrete_map={'Analytics Job': '#FF6B6B', 'Non-Analytics Job': '#4ECDC4'},
title='Top 10 Specialized Skills by Job Category'
)
# Step 7: Update layout for styling
fig.update_layout(
width=900,
height=600,
plot_bgcolor='white',
paper_bgcolor='white',
font=dict(family='Inter, sans-serif', size=14, color='#2D3748'),
title=dict(
font=dict(size=24, color='#FF6B6B'),
x=0.5,
xanchor='center',
y=0.99,
yanchor='top'
),
xaxis=dict(
title='Number of Jobs',
title_font=dict(size=16),
tickfont=dict(size=12),
gridcolor='#E2E8F0',
linecolor='#2D3748',
linewidth=2,
showline=True,
showgrid=True,
zeroline=False
),
yaxis=dict(
title='Specialized Skill',
title_font=dict(size=16),
tickfont=dict(size=12)
),
legend=dict(
title='Job Category',
font=dict(size=13),
bgcolor='#FFFFFF',
bordercolor='#FF6B6B',
borderwidth=1,
x=1.02,
y=0.5,
xanchor='left',
yanchor='middle'
)
)
# Save to HTML
fig.write_html(
'figures/edaplot2.html',
include_plotlyjs='cdn',
full_html=False
)Overview: This stacked bar chart shows the top 10 specialized skills required for jobs, split into Analytics Jobs (red) and Non-Analytics Jobs (teal), with the number of jobs on the x-axis (0 to 25k) and skills on the y-axis. Key Findings: SQL (Programming Language): Leads with over 25k total jobs, with the majority (around 75%) being Analytics Jobs. This highlights SQL’s critical role in data querying and management for analytics roles. Data Analysis: Ranks second with around 22k Analytics Jobs, showing its core relevance to analytics positions, with minimal non-analytics demand. SAP Applications and Business Process: Each have around 15k jobs, but with a more balanced split (50% analytics, 50% non-analytics), indicating their use in both operational and analytical roles. Python (Programming Language): Around 12k jobs, mostly analytics (80%), reflecting Python’s popularity for data science and machine learning tasks. Dashboard and Business Intelligence: Each around 10k jobs, predominantly analytics (90%), showing the importance of visualization tools in analytics roles. Finance, Project Management, and Business Requirements: Each around 10k jobs, with a 60-40 split (analytics vs. non-analytics), suggesting these skills are valued in both domains. Implications: Analytics jobs heavily demand technical skills like SQL, Python, and Data Analysis, aligning with industry trends where data-driven decision-making is key. Skills like SAP Applications and Business Process bridge analytics and non-analytics roles, offering graduates versatility in career paths. For students, prioritizing SQL, Python, and dashboard skills can maximize opportunities in analytics roles, especially in Tech and Finance sectors.
import plotly.express as px
import pandas as pd
# Prepare the data
df = eda.copy()
# Define analytics jobs (Data Analyst + Business Analyst)
def classify_analytics_job(row):
if row['DATA_ANALYST_JOB']:
return True
title = str(row['TITLE_NAME']).lower() if 'TITLE_NAME' in row else str(row['TITLE']).lower()
if 'business analyst' in title:
return True
return False
df['IS_ANALYTICS_JOB'] = df.apply(classify_analytics_job, axis=1)
df['Job_Category'] = df['IS_ANALYTICS_JOB'].map({True: 'Analytics Job', False: 'Non-Analytics Job'})
# Calculate average years of experience
df['Avg_Years_Experience'] = (df['MIN_YEARS_EXPERIENCE'] + df['MAX_YEARS_EXPERIENCE']) / 2
# Clean the data (remove rows with missing salary or experience)
df = df.dropna(subset=['Avg_Years_Experience', 'SALARY'])
# Create the scatter plot with trend line
fig = px.scatter(df,
x='Avg_Years_Experience',
y='SALARY',
color='Job_Category',
trendline='ols', # Add trend line (ordinary least squares)
title='Experience Requirements vs Salary for Analytics Jobs',
labels={'Avg_Years_Experience': 'Average Years of Experience', 'SALARY': 'Salary ($)', 'Job_Category': 'Job Category'},
color_discrete_map={'Analytics Job': '#FF6B6B', 'Non-Analytics Job': '#4ECDC4'})
# Beautify the layout with a red-white theme (no gradients)
fig.update_layout(
width=900,
height=600,
plot_bgcolor='#FFFFFF', # Plain white background
paper_bgcolor='#FFFFFF', # Plain white background
font=dict(family="Inter, sans-serif", size=14, color="#2D3748"),
title=dict(
font=dict(size=24, color="#FF6B6B"), # Red title for theme
x=0.5,
xanchor="center",
y=0.95,
yanchor="top"
),
xaxis=dict(
title="Average Years of Experience",
title_font=dict(size=16),
tickfont=dict(size=12),
gridcolor="#E2E8F0",
linecolor="#2D3748",
linewidth=2,
showline=True,
showgrid=True,
zeroline=False
),
yaxis=dict(
title="Salary ($)",
title_font=dict(size=16),
tickfont=dict(size=12),
gridcolor="#E2E8F0",
linecolor="#2D3748",
linewidth=2,
showline=True,
showgrid=True,
zeroline=False
),
legend=dict(
title="Job Category",
font=dict(size=13),
bgcolor="#FFFFFF",
bordercolor="#FF6B6B", # Red border for theme
borderwidth=1,
x=1.02,
y=0.5,
xanchor="left",
yanchor="middle"
),
hovermode="closest",
hoverlabel=dict(
bgcolor="#FFFFFF",
font_size=12,
font_family="Inter, sans-serif",
font_color="#2D3748",
bordercolor="#FF6B6B" # Red border for hover
)
)
# Customize scatter points
fig.update_traces(
marker=dict(
size=8,
opacity=0.7,
line=dict(width=1, color="#2D3748")
)
)
fig.write_html(
"figures/edaplot3.html",
include_plotlyjs="cdn", # Use CDN to load Plotly JS
full_html=False # Only include the plot div
)Overview: This scatter plot compares average years of experience (x-axis, 0 to 14 years) to salary (y-axis, $0 to $500k), with Analytics Jobs in red and Non-Analytics Jobs in teal. Key Findings: The plot is empty, indicating no data points were plotted for either Analytics or Non-Analytics Jobs. This suggests a potential issue with the dataset—either missing salary/experience data or a filtering error during visualization. Implications: Without data, we can’t analyze the relationship between experience and salary. However, based on industry trends, we’d expect Analytics Jobs to show higher salaries with increased experience due to their specialized nature (e.g., data scientists at firms like Citadel earn $1M+ with 10+ years, as discussed earlier). This highlights a need for better data collection or preprocessing to ensure critical variables like salary and experience are captured for meaningful analysis. For graduates, this underscores the importance of verifying data quality in analytics projects to avoid misleading conclusions.
import plotly.express as px
import pandas as pd
# Assuming eda is already loaded and DATA_ANALYST_JOB is defined
df = eda.copy()
# Step 1: Clean the data (remove rows with missing STATE_NAME)
df = df.dropna(subset=['STATE_NAME'])
# Debug: Check the number of rows after cleaning and unique states
# print("Number of rows after cleaning:", len(df))
# print("Unique states extracted:", df['STATE_NAME'].unique())
# Step 2: Map DATA_ANALYST_JOB to labels
df['Job_Category'] = df['DATA_ANALYST_JOB'].map({True: 'Analytics Job', False: 'Non-Analytics Job'})
# Step 3: Aggregate data by state and job category
df_state_counts = df.groupby(['STATE_NAME', 'Job_Category']).size().reset_index(name='Job_Count')
# Step 4: Pivot the data to get counts for Analytics and Non-Analytics Jobs
df_pivot = df_state_counts.pivot(index='STATE_NAME', columns='Job_Category', values='Job_Count').fillna(0)
df_pivot['Total_Jobs'] = df_pivot.get('Analytics Job', 0) + df_pivot.get('Non-Analytics Job', 0)
df_pivot = df_pivot.reset_index()
# Debug: Check the aggregated data
# print("Aggregated data by state:")
# print(df_pivot)
# Step 5: Find states with minimum and maximum jobs
min_jobs_row = df_pivot.loc[df_pivot['Total_Jobs'].idxmin()]
max_jobs_row = df_pivot.loc[df_pivot['Total_Jobs'].idxmax()]
min_jobs_state = min_jobs_row['STATE_NAME']
max_jobs_state = max_jobs_row['STATE_NAME']
min_jobs_count = min_jobs_row['Total_Jobs']
max_jobs_count = max_jobs_row['Total_Jobs']
# Step 6: Map state names to state codes for Plotly choropleth
state_name_to_code = {
'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA',
'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',
'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'
}
df_pivot['State_Code'] = df_pivot['STATE_NAME'].map(state_name_to_code)
# Step 7: Clean the data (remove rows with unmapped states)
df_pivot = df_pivot.dropna(subset=['State_Code'])
# Debug: Check the final data before plotting
# print("Final data for plotting:")
# print(df_pivot[['STATE_NAME', 'State_Code', 'Total_Jobs']])
# Step 8: Define the color scale range and ticks
min_range = 100
max_range = 8100
increment = 900 # To get 10 divisions between 100 and 8100
tickvals = list(range(min_range, max_range + increment, increment))
tickvals = tickvals[:10] # Ensure 8-10 divisions
ticktext = [str(val) for val in tickvals]
# Debug: Print the tick values for the color bar
print("Color bar tick values:", tickvals)
# Step 9: Create the choropleth map with a custom linear color scale
fig = px.choropleth(
df_pivot,
locations='State_Code',
locationmode='USA-states',
color='Total_Jobs',
color_continuous_scale='Reds', # Use a red gradient for impact
scope='usa',
range_color=[min_range, max_range], # Explicitly set the range
title='Geographic Distribution of Analytics Job Postings (2025)',
hover_data=['STATE_NAME', 'Analytics Job', 'Non-Analytics Job', 'Total_Jobs'],
labels={'Total_Jobs': 'Number of Jobs'}
)
# Step 10: Update layout for a stunning visualization
fig.update_layout(
width=1000,
height=700,
plot_bgcolor='white',
paper_bgcolor='white',
font=dict(family='Inter, sans-serif', size=14, color='#2D3748'),
title=dict(
text='Geographic Distribution of Analytics Job Postings (2025)',
font=dict(size=28, color='#FF6B6B', family='Inter, sans-serif'),
x=0.5,
xanchor='center',
y=0.95,
yanchor='top'
),
geo=dict(
bgcolor='white',
lakecolor='white',
landcolor='lightgray',
subunitcolor='black',
showlakes=True,
showsubunits=True,
showframe=True,
framecolor='#2D3748',
framewidth=2
),
coloraxis_colorbar=dict(
title='Number of Jobs',
title_font=dict(size=16, family='Inter, sans-serif', color='#FF6B6B'),
tickfont=dict(size=12, family='Inter, sans-serif', color='#2D3748'),
tickvals=tickvals,
ticktext=ticktext,
len=0.8,
thickness=20,
outlinecolor='#2D3748',
outlinewidth=1,
bgcolor='rgba(255,255,255,0.8)'
),
margin=dict(l=50, r=50, t=100, b=50)
)
# Step 11: Add annotations for Wyoming and Texas
fig.add_annotation(
x=0.05,
y=0.05,
xref="paper",
yref="paper",
text=f"Lowest: Wyoming ({min_jobs_count} jobs)",
showarrow=False,
font=dict(size=12, color='#FF6B6B', family='Inter, sans-serif'),
bgcolor='rgba(255,255,255,0.8)',
bordercolor='#FF6B6B',
borderwidth=1
)
fig.add_annotation(
x=0.95,
y=0.05,
xref="paper",
yref="paper",
text=f"Highest: Texas ({max_jobs_count} jobs)",
showarrow=False,
font=dict(size=12, color='#FF6B6B', family='Inter, sans-serif'),
bgcolor='rgba(255,255,255,0.8)',
bordercolor='#FF6B6B',
borderwidth=1,
xanchor='right'
)
# Step 12: Save to HTML
fig.write_html(
'figures/edaplot4.html',
include_plotlyjs='cdn',
full_html=False
)Color bar tick values: [100, 1000, 1900, 2800, 3700, 4600, 5500, 6400, 7300, 8200]
Overview: This choropleth map shows the geographic distribution of Analytics Job postings across U.S. states in 2025, with a color gradient from light (100 jobs) to dark red (7,300 jobs). Key Findings: Highest Demand: Texas leads with 8,050 jobs, followed by California (around 7,000 jobs), New York, and Illinois (each around 4,000–5,000 jobs). Lowest Demand: Wyoming has the fewest jobs at 103, with other states like Montana, North Dakota, and Vermont also showing low numbers (100–200 jobs). Regional Trends: High concentrations in Texas, California, and New York align with economic hubs—Texas (Dallas, Houston), California (Silicon Valley), and New York (NYC financial district). Mid-Tier States: States like Florida, Virginia, and Georgia have 2,000–3,000 jobs, indicating growing demand in the Southeast. Implications: Texas and California are prime locations for Analytics Jobs, supporting our earlier findings about hedge funds like Citadel (Miami) and tech firms like Google (California) hiring heavily for data analysts. The concentration in economic hubs suggests graduates should target these states for better job prospects, especially in Tech and Finance sectors. Low-demand states like Wyoming indicate limited opportunities, likely due to smaller economies and less focus on data-driven industries.
import plotly.graph_objects as go
import pandas as pd
# Assuming eda is already loaded and DATA_ANALYST_JOB is defined
df = eda.copy()
# Step 1: Clean the data (remove rows with missing values for key columns)
df = df.dropna(subset=['MIN_EDULEVELS_NAME', 'NAICS2_NAME'])
# Step 2: Map DATA_ANALYST_JOB to labels
df['Job_Category'] = df['DATA_ANALYST_JOB'].map({True: 'Analytics Job', False: 'Non-Analytics Job'})
# Step 3: Shorten industry names for better display
short_names = {
'Professional, Scientific, and Technical Services': 'Tech. Services',
'Administrative and Support and Waste Management and Remediation Services': 'Admin & Waste Mgmt',
'Health Care and Social Assistance': 'Healthcare',
'Finance and Insurance': 'Finance',
'Information': 'Info Tech',
'Educational Services': 'Education',
'Manufacturing': 'Manufacturing',
'Retail Trade': 'Retail',
'Accommodation and Food Services': 'Hospitality',
'Other Services (except Public Administration)': 'Other Services'
}
df['Industry'] = df['NAICS2_NAME'].map(short_names).fillna(df['NAICS2_NAME'])
# Step 4: Get the top 5 industries by total job count to keep the diagram manageable
total_jobs_by_industry = df.groupby('Industry').size().nlargest(5)
top_industries = total_jobs_by_industry.index.tolist()
df = df[df['Industry'].isin(top_industries)]
# Step 5: Aggregate data to get flows
# Flow from Education Level to Job Category
edu_to_job = df.groupby(['MIN_EDULEVELS_NAME', 'Job_Category']).size().reset_index(name='Count')
# Flow from Job Category to Industry
job_to_industry = df.groupby(['Job_Category', 'Industry']).size().reset_index(name='Count')
# Debug: Check the aggregated data
# print("Flow from Education Level to Job Category:")
# print(edu_to_job)
# print("Flow from Job Category to Industry:")
# print(job_to_industry)
# Step 6: Create nodes and links for the Sankey diagram
# Nodes: Education Levels + Job Categories + Industries
edu_levels = list(df['MIN_EDULEVELS_NAME'].unique())
job_categories = list(df['Job_Category'].unique())
industries = list(df['Industry'].unique())
# Create a list of all nodes
all_nodes = edu_levels + job_categories + industries
node_indices = {node: idx for idx, node in enumerate(all_nodes)}
# Links: Education Level -> Job Category
links_source = []
links_target = []
links_value = []
for _, row in edu_to_job.iterrows():
source = node_indices[row['MIN_EDULEVELS_NAME']]
target = node_indices[row['Job_Category']]
value = row['Count']
links_source.append(source)
links_target.append(target)
links_value.append(value)
# Links: Job Category -> Industry
for _, row in job_to_industry.iterrows():
source = node_indices[row['Job_Category']]
target = node_indices[row['Industry']]
value = row['Count']
links_source.append(source)
links_target.append(target)
links_value.append(value)
# Satep 7: Create the Sankey diagram
fig = go.Figure(data=[go.Sankey(
node=dict(
pad=15,
thickness=20,
line=dict(color='#2D3748', width=0.5),
label=all_nodes,
color=['#FF6B6B' if 'Job' in node else '#FFE5E5' for node in all_nodes] # Red for job categories, light red for others
),
link=dict(
source=links_source,
target=links_target,
value=links_value,
color=['#FF6B6B' if all_nodes[source] in job_categories else '#FFE5E5' for source in links_source] # Red links for job category flows
)
)])
# Step 8: Update layout for a stunning visualization
fig.update_layout(
title=dict(
text='Flow of Jobs: Education Level → Job Category → Industry (2025)',
font=dict(size=28, color='#FF6B6B', family='Inter, sans-serif'),
x=0.5,
xanchor='center',
y=0.95,
yanchor='top'
),
width=1000,
height=600,
font=dict(family='Inter, sans-serif', size=14, color='#2D3748'),
plot_bgcolor='white',
paper_bgcolor='white',
margin=dict(l=50, r=50, t=100, b=50)
)
# Step 9: Add annotations for context
fig.add_annotation(
x=0.05,
y=0.05,
xref="paper",
yref="paper",
text="Flow represents job counts across categories",
showarrow=False,
font=dict(size=12, color='#FF6B6B', family='Inter, sans-serif'),
bgcolor='rgba(255,255,255,0.8)',
bordercolor='#FF6B6B',
borderwidth=1
)
# Step 10: Save to HTML
fig.write_html(
'figures/edaplot5.html',
include_plotlyjs='cdn',
full_html=False
)Overview: This Sankey diagram illustrates the flow of jobs from Education Level (left) to Job Category (middle) to Industry (right), with flow width representing job counts. Key Findings: Education Levels: Bachelor’s Degree: Largest group, flowing into both Analytics and Non-Analytics Jobs. No Education Listed: Second largest, also split between both job categories. Master’s Degree and Ph.D./Professional Degree: Smaller flows, mostly into Analytics Jobs. Associate Degree and High School/GED: Minimal flows, mostly to Non-Analytics Jobs. Job Categories: Analytics Jobs receive significant flows from Bachelor’s, Master’s, and Ph.D. degrees, reflecting the technical nature of these roles. Non-Analytics Jobs have broader contributions from No Education Listed and Associate/High School degrees. Industries: Tech. Services: Receives the largest flow from Analytics Jobs, especially from Bachelor’s and Master’s degrees. Unclassified Industry: Second largest, with a mix of Analytics and Non-Analytics Jobs, consistent with our earlier histogram findings. Admin & Waste Management, Finance, and Manufacturing: Significant flows from both job categories, with Finance leaning more toward Analytics Jobs. Implications: A Bachelor’s degree is the most common entry point for Analytics Jobs, particularly in Tech. Services and Finance, aligning with BlackRock’s hiring for their Full-Time Analyst Program (targeting graduates). Higher degrees (Master’s, Ph.D.) lead to Analytics Jobs in specialized industries like Tech and Finance, offering a competitive edge for roles requiring advanced skills. The large flow to Unclassified Industry suggests data quality issues, as noted earlier, but also indicates diverse opportunities across sectors. Graduates should focus on Tech and Finance industries, where education levels align with high demand for analytics skills, but also consider emerging sectors like Admin & Waste Management.